Teradata: OLAP Functions

OLAP or On Line Analytical Processing functions are knows as “Ordered” analytical functions in Teradata environment. These functions provides support for many decision support operations in data mining environments that require an ordered set of results set or depend on values from a previous row. These functions are categorized as follows:

Window aggregate functions
Rank function
Distribution function
Row number function
Teradata SQL-specific functions

Advantages of Analytical Functions: Why someone would like to use analytical functions without making any sense of it, let’s discuss the reasons.

Faster Processing: – As these functions are database specific and processed by query execution engine, processing takes lesser time than in case of using external sorting mechanism.
Elimination of the need to use external tools which might need exporting large data sets to another environment. Thus coding effort is less too.

Now let’s see which function is used for what. Though most of these functions are aggregate functions, later you would see the usage is different.

SUM/CSUM :– These functions are used to compute a cumulative sum of a particular group of rows. SUM also can be used to simply calculate group sum. For moving sum use MSUM.
COUNT : – To calculate cumulative or moving count.
AVG : – Similarly to compute the moving average use the AVG or MAVG function.
MDIFF : – To see the difference between the current row (column) and the preceding nth row (column) value. If you want to see the sales numbers (increasing or decreasing) on a daily basis, use this function.
MLINREG : – To project the next value in a series based on the data pattern present in the series.
QUANTILE : – To divide the result set into partitions with equal number of rows present in each partition.
RANK : – This function is used to display the ordered rank of all rows in a particular group.
PERCENT_RANK:- To find out relative rank of a row in a group use PERCENT_RANK.
ROW_NUMBER:- To get the sequential row number of the row within its data subset.
MAX/MIN: – To calculate the maximum or minimum cumulative value in a group.

RANK ():- As the function name implies, RANK returns ranking (ordered) of rows based on the number or expression given in the ORDER BY clause. Note that we do not need to give anything inside the RANK function. Just give it to the ORDER BY clause and things would be taken care. Let’s work out a scenario.

Scenario is to find out the sales figure for each store based on their actual sales.

SELECT
store_no
,sale_month
,actual_sale,
RANK () OVER (PARTITION BY store_no ORDER BY actual_sale) as SaleRank
FROM sales;

Note that there is nothing being fed to the function RANK. Result set below shows the sales figure ranked, for each store.
      store_no sale_month    actual_sale    SaleRank
          10      05/01/2007    13,500           1
          10      01/01/2007    13,500           1
          20      02/01/2007    11,000           1
          20      04/01/2007    12,500           2
          30      02/01/2007    10,000           1
          30      03/01/2007    10,500           2
          30      03/01/2007    12,500           3
          40      02/01/2007    11,500           1
          40      04/01/2007    12,500           2
          50      01/01/2007    10,500           1
          50      05/01/2007    12,500           2
          60      06/01/2007    11,500           1
          70      07/01/2007    11,500           1
          80      07/01/2007    15,500           1
          90      06/01/2007    15,500           1

ROW_NUMBER ():- It returns the sequential number of the row within a group, starting with 1 based on the ORDER BY clause. With the above scenario if we replace the RANK with ROW_NUMBER and run the following query:

SELECT
store_no
,sale_month
,actual_sale,
ROW_NUMBER() OVER (PARTITION BY store_no ORDER BY actual_sale) as SaleRank
FROM sales;

The result set we get is similar, only difference is for first two rows we have 1,1 with RANK but 1,2 with ROW_NUMBER. This is because ROW_NUMBER takes into consideration the sequential occurrence of the row in a group which is store_no here.

store_no   sale_month    actual_sale    SaleRank
10            05/01/2007    13,500           1
10            01/01/2007    13,500           2
20            02/01/2007    11,000           1
20            04/01/2007    12,500           2
30            02/01/2007    10,000           1
30            03/01/2007    10,500           2
30            03/01/2007    12,500           3
40            02/01/2007    11,500           1
40            04/01/2007    12,500           2
50            01/01/2007    10,500           1
50            05/01/2007    12,500           2
60            06/01/2007    11,500           1
70            07/01/2007    11,500           1
80            07/01/2007    15,500           1
90            06/01/2007    15,500           1

COUNT ():- As you know when you use this function as aggregate function, it would just return the total number of rows present in the SELECT statement. But the same function when used with OLAP, returns the cumulative or moving count for an expression, let’s see how this is done. Let’s also use the same data set present in the initial article. Scenario is to find out, number of stores selling a particular product.

This is called grouping count and the same can be done in two ways one with taking store_no as count, partitioning by prod_code and the other ways to reverse the logic of taking prod_code as count and partition by store_no.

The following queries would give the desired result.

SELECT store_no, sale_month, actual_sale, prod_code,
COUNT (prod_code) OVER (PARTITION BY store_noROWS BETWEEN UNBOUNDED PRECEDING ANDUNBOUNDED FOLLOWING)
FROM sales;

SELECT store_no, sale_month, actual_sale, prod_code,
COUNT (store_no) OVER (PARTITION BY prod_codeROWS BETWEEN UNBOUNDED PRECEDING ANDUNBOUNDED FOLLOWING)
FROM sales;

store_no    sale_month    actual_sale    prod_code     Group Count(store_no)
10             01/01/2007    13,500          Apple                      2
60             06/01/2007    11,500          Apple                      2
10             05/01/2007    13,500          Apricot                   2
40             04/01/2007    12,500          Apricot                   2
70             07/01/2007    11,500          Banana                   2
20             02/01/2007    11,000          Banana                   2
50             05/01/2007    12,500          Grapes                    2
80             07/01/2007    15,500          Grapes                    2
50             01/01/2007    10,500          Guava                     2
40             02/01/2007    11,500          Guava                     2
30             03/01/2007    12,500          Mango                    2
20             04/01/2007    12,500          Mango                    2
30             03/01/2007    10,500          Orange                   2
90             06/01/2007    15,500          Orange                   2
30             02/01/2007    10,000          Strawberry              1

New Terms:-
ROWS BETWEEN: – This clause is to specify the start and end of the aggregation group.The default aggregation group is ROWS BETWEEN UNBOUNDED PRECEDING ANDUNBOUNDED FOLLOWING (take out this portion from the above query and run).
UNBOUNDED PROCEEDING: – When this is specified, the entire partition that precedes the current row is taken into the aggregation group.
UNBOUNDED FOLLOWING: – Used to define the entire partition that follows the current row.

SUM ():- When used with partition clause, SUM () function returns the cumulative or moving sum of an expression based on how the aggregation group is specified. This function can also be used in place of CSUM and MSUM which are Teradata-specific functions and are discouraged to a great extent. As per recent recommendations from Teradata, usage of ANSI-compliant window function for any new applications is advised. Lets see why Teradata suggests to go for ANSI-compliant SUM () function.
• If you are using SUM function and want to calculate the cumulative SUM, then just specify ORDER BY clause with ROWS UNBOUNDED PRECEDING which would give the same result as that of CSUM function.
• To compute moving average using SUM window function use ORDER BY clause and specify ROWS number PRECEDING (number of rows preceding the current row).
Let’s work out some examples using the data present in the previous section.
Scenario 1:- Calculate the cumulative actual sales per store ordered by sale month:
SELECT store_no, sale_month, actual_sale,
SUM (actual_sale) OVER (PARTITION BY store_no ORDER BY sale_month ROWS UNBOUNDED PRECEDING) as StoreTotal
FROM sales;ORDER BY store_no, sale_month;
Result Set:-
store_no   sale_month         actual_sale   StoreTotal
10         01/01/2007            13,500          13,500
10         05/01/2007            13,500          27,000
20         02/01/2007            11,000          11,000
20         04/01/2007            12,500          23,500
30         02/01/2007            10,000          10,000

30         03/01/2007            12,500          22,500
30         03/01/2007            10,500          33,000
40         02/01/2007            11,500          11,500
40         04/01/2007            12,500          24,000
50         01/01/2007            10,500          10,500
50         05/01/2007            12,500          23,000
60         06/01/2007            11,500          11,500
70         07/01/2007            11,500          11,500
80         07/01/2007            15,500          15,500
90         06/01/2007            15,500          15,500

Scenario 2:- Calculate the total actual sales of fruits by category per each store:
SELECT store_no, prod_code, actual_sale,
SUM (actual_sale) OVER (PARTITION BY prod_code ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as FruitTotal
FROM sales;
Result would be as follows:
store_no   prod_code   actual_sale   FruitTotal
60         Apple            11,500           25,000
10         Apple            13,500           25,000
40         Apricot          12,500           26,000
10         Apricot          13,500           26,000
20         Banana          11,000           22,500
70         Banana          11,500           22,500
50         Grapes           12,500           28,000
80         Grapes           15,500           28,000
50         Guava            10,500           22,000
40         Guava            11,500           22,000
20         Mango            12,500          25,000
30         Mango            12,500          25,000
90         Orange           15,500          26,000
30         Orange           10,500          26,000
30         Strawberry      10,000          10,000

Scenario 3:- Calculate the moving actual sales per month per store.
SELECT store_no, sale_month, actual_sale,
SUM (actual_sale) OVER (PARTITION BY store_no, sale_month ORDER BY actual_sale ROWS 1 PRECEDING) as SaleTrend
FROM sales;
And the Result would be:
store_no   sale_month         actual_sale   SaleTrend
10         01/01/2007           13,500          13,500
10         05/01/2007           13,500          13,500

20         02/01/2007           11,000          11,000
20         04/01/2007           12,500          12,500
30         02/01/2007           10,000          10,000
30         03/01/2007           10,500          10,500

30         03/01/2007           12,500          23,000
40         02/01/2007           11,500          11,500
40         04/01/2007           12,500          12,500
50         01/01/2007           10,500          10,500

50         05/01/2007           12,500          12,500
60         06/01/2007           11,500          11,500
70         07/01/2007           11,500          11,500
80         07/01/2007           15,500          15,500
90         06/01/2007           15,500          15,500

Let’s work out some examples of OLAP functions that are being discussed in the previous articles. Before that make sure you have a table and there is data present init. Follow the below instruction to create a temporary table and insert data into it. While inserting data you might encounter one of the following errors:

3520: A constant value in a query is not valid for column sale_month.
3535: A character string failed conversion to a numeric value.
2665: Invalid date.

All of these errors are due to the date value that you are inserting. If you encounter any of these errors, try permutation and combination of the date value and see if you can get rid of the error. If you still can not then find out from your sys admin what date format in allowed while you are inserting dates. Here in this example the format is ‘yyyy-mm-dd’. All of the below statements are tested in SQL assistant. Create a Volatile Table “Sales” as follows:

CREATE SET VOLATILE TABLE <username>.sales,
NO FALLBACK,
CHECKSUM = DEFAULT,
LOG
(
“store_no” INTEGER,
“sale_month” DATE,
“prod_code” CHAR (20) CHARACTER SET LATIN NOT CASESPECIFIC,
“projected_sale” INTEGER,
“actual_sale” INTEGER
)
PRIMARY INDEX ( “store_no” )
ON COMMIT PRESERVE ROWS;

Insert data as follows:
INSERT INTO sales VALUES(10,’2007-01-01′,’Apple’,15000,13500);
INSERT INTO sales VALUES(20,’2007-02-01′,’Banana’,10000,11000);
INSERT INTO sales VALUES(30,’2007-03-01′,’Orange’,11000,10500);
INSERT INTO sales VALUES(40,’2007-04-01′,’Apricot’,12500,12500);
INSERT INTO sales VALUES(50,’2007-05-01′,’Grapes’,15000,12500);
INSERT INTO sales VALUES(60,’2007-06-01′,’Apple’,12000,11500);
INSERT INTO sales VALUES(70,’2007-07-01′,’Banana’,15500,11500);
INSERT INTO sales VALUES(80,’2007-07-01′,’Grapes’,15000,15500);
INSERT INTO sales VALUES(90,’2007-06-01′,’Orange’,16000,15500);
INSERT INTO sales VALUES(10,’2007-05-01′,’Apricot’,14500,13500);
INSERT INTO sales VALUES(20,’2007-04-01′,’Mango’,15500,12500);
INSERT INTO sales VALUES(30,’2007-03-01′,’Mango’,15000,12500);
INSERT INTO sales VALUES(40,’2007-02-01′,’Guava’,14000,11500);
INSERT INTO sales VALUES(50,’2007-01-01′,’Guava’,12000,10500);
INSERT INTO sales VALUES(30,’2007-02-01′,’Strawberry’,12000,10000);

Now take the example of AVG () function. As you know AVG () computes the cumulative or moving average of a column, we will calculate something similar in the following scenario.
Scenario is to find out the actual sales in each store, averaged over the current month and the preceding month. Here preceding month would be understood as the previous month to the current month, now what is current month, it is the first row returned after the result set is ordered after partitioning. Following is the query.

SELECT store_no,
sale_month,
actual_sale,
projected_sale,
AVG(actual_sale) OVER (PARTITION BY store_no
          ORDER BY projected_sale ROWS 1 PRECEDING)
FROM sales;

And it would return the following result set.
store_no     sale_month    actual_sale    projected_sale Moving Avg(actual_sale)
10                5/1/2007       13,500          14,500          13,500.00
10                1/1/2007       13,500          15,000          13,500.00
20                2/1/2007       11,000          10,000          11,000.00
20                4/1/2007       12,500          15,500          11,750.00
30                3/1/2007       10,500          11,000          10,500.00
30                2/1/2007       10,000          12,000          10,250.00
30                3/1/2007       12,500          15,000          11,250.00
40                4/1/2007       12,500          12,500          12,500.00
40                2/1/2007       11,500          14,000          12,000.00
50                1/1/2007       10,500          12,000          10,500.00
50                5/1/2007       12,500          15,000          11,500.00
60                6/1/2007       11,500          12,000          11,500.00
70                7/1/2007       11,500          15,500          11,500.00
80                7/1/2007       15,500          15,000          15,500.00
90                6/1/2007       15,500          16,000          15,500.00

Note that the moving average number is changing every 2nd row in case there is a difference in between the 1st and 2nd row figure within the same store number. Particularly for the store number 30, you can see that there is a slight increase in the moving average in spite of good increase in actual sales. Projected Sales is also added just to have an idea how the actual sale is performing over the project sale.

Teradata

OLAP Functions

No comments:

Post a Comment